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Abstract —Feature selection and feature transformation, the 
two main ways to reduce dimensionality, are often presented 
separately. In this paper, a feature selection method is proposed 
hy combining the popular transformation based dimensionality 
reduction method Linear Discriminant Analysis (LDA) and spar¬ 
sity regularization. We impose row sparsity on the transformation 
matrix of LDA through ^2,1-norm regularization to achieve 
feature selection, and the resultant formulation optimizes for 
selecting the most discriminative features and removing the 
redundant ones simultaneously. The formulation is extended to 
the £2,p-norm regularized case: which is more likely to offer 
better sparsity when 0 < p < 1. Thus the formulation is a better 
approximation to the feature selection problem. An efficient 
algorithm is developed to solve the £2,p-norm based optimization 
problem and it is proved that the algorithm converges when 
0 < p < 2. Systematical experiments are conducted to under¬ 
stand the work of the proposed method. Promising experimental 
results on various types of real-world data sets demonstrate the 
effectiveness of our algorithm. 

Index Terms —Feature selection. Linear discriminant analysis, 
^2,p-norm minimization. Feature redundancy. 

I. Introduction 

In many applications in computer vision, data mining and 
pattern recognition, data are characterized by tens or hundreds 
of thousands of variables or features. High dimensionality 
significantly increases the time and space requirements for 
processing the data. Moreover, some features are irrelevant 
and redundant. The existence of these features may result in 
low efficiency, over-fitting and poor prediction performance in 
learning tasks El-El. Consequently, dimensionality reduction 
has become an important stage of data preprocessing in such 
applications 0, Q. 

Feature selection and feature transformation are the two 
main ways to reduce dimensionality 111, ||9l. Whereas feature 
transformation methods transform the original features to a 
new feature subspace, feature selection selects a subset of fea¬ 
tures of the original set. In contrast to feature transformation, 
feature selection does not alter the original representation of 
the variables. Thus, feature selection preserves the original 
semantics of the variables, thereby offering the advantage 
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of interpretability. Another advantage of feature selection is 
that only the selected features need to be collected or calcu¬ 
lated, while all input features are required to obtain the low¬ 
dimensional representation in feature transformation methods. 
As a result, many studies focus on addressing the problem of 
feature selection during the past a few years. 

While feature selection can be applied to both supervised 
and unsupervised learning, we focus on the problem of su¬ 
pervised learning (classification), where the label information 
is available. According to how the classification algorithm 
is incorporated in evaluating and selecting features, feature 
selection methods can be organized into three categories 0: 

(1) filter methods E01-E3, where the selection is independent 
of the classifiers, (2) wrapper methods ID, 0, where a feature 
subset search algorithm is wrapped around the classification 
model and the feature subsets are scored based on their 
predictive power, and (3) embedded methods El, El, which 
search for an optimal subset of features in the process of 
classifier construction. Compared to filter methods, wrapper 
methods and embedded methods are tightly coupled with a 
specific classifier, thus they often have good performance 
but also very expensive computational costs. In this paper, 
we focus on the filter-type methods for supervised feature 
selection. 

Filter-based feature selection methods can be classified 
into two subtypes: (1) feature ranking (univariate techniques) 
and (2) feature subset evaluation (multivariate techniques). 
Filter-based feature selection utilizes the intrinsic properties 
of the data to evaluate the importance of (1) each individual 
feature in feature ranking methods or (2) the entire feature 
subset in the case of feature subset evaluation, with respect 
to (w.r.t) a certain proposed performance criterion. Feature 
ranking methods often find suboptimal solutions due to the 
following two reasons: (1) The interaction among features 
is neglected. Feature interaction exists if a feature forms a 
subset with other ones and the subset has strong correlation 
with the class E0, ED- Evaluating features individually does 
not consider the relevance of a feature subset, and features 
in a relevant subset will be removed if they are low-scored. 

(2) Redundant features, i.e., features with similar predictive 
power, or more specifically, highly correlated features, cannot 
be eliminated if they are all highly scored. In fact, many 
studies E2l,E6l have shown that removing redundant features 
can improve the prediction accuracy. Multivariate techniques 
overcome this problem to some degree. Nevertheless, both 
subtypes of filter-based feature selection methods use only the 
intrinsic characteristics of the data without using the learning 
mechanism. This mechanism is proved to be powerful and has 
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been widely used in many areas ini-iii. 

The goal of supervised feature selection is to find the 
most discriminative features that can distinguish different 
classes. Thus discriminant analysis plays an important role in 
supervised feature selection 0, ESI, ED- The Fisher Score 
algorithm ESI is a widely applied filter-type feature selection 
algorithm based on linear discriminant analysis (LDA). How¬ 
ever, it values features individually and therefore cannot deal 
with feature interaction and feature redundancy, i.e., the high 
correlation among the features ini. s. Niijima and S. Kuhara 
use the Maximum Margin Criterion (MMC), a variant of LDA, 
for feature selection. They recursively remove features with the 
smallest absolute values of the discriminant vectors yielded by 
MMC, until the desired number of features are removed ll22ll . 

M. Masaeli et al. propose converting LDA into a new filter- 
based feature selection algorithm named Linear Discriminant 
Feature Selection (LDFS) ES- By enforcing row sparsity 
on the transformation matrix of LDA through foo.i-nomi 
regularization, LDFS uses both the discriminative information 
and the learning mechanism. As the features are selected 
jointly by a learning process, LDFS manages to optimize 
for feature relevance and redundancy removal simultaneously. 
However, the formulation of LDFS ignores the possibility of 
arbitrary scalability of the transformation matrix, so that it has 
a trivial solution of all zeros. Thus, it would lose its ability to 
select features when arriving at the trivial solution. 

In this paper, we first prove the existence of the trivial 
solution of the formulation of LDFS. Then a new formulation 
is propounded to avoid the trivial solution, in which the 
transformation vectors are constrained to be uncorrelated. 
Instead of utilizing ^oo.i-norm regularization, we adopt ^ 2 , 1 - 
norm minimization, which can be solved by a simpler algo¬ 
rithm and ensures the ability of feature selection as well. The 
proposed formulation not only avoids the trivial solution, but 
also inherits LDFS’s merit of selecting the most discriminative 
features and removing the redundant ones simultaneously. 

Both £00 1 -norm and £ 2 , 1 -norm are extensions of £i-norm. 
£i-norm is used most frequently to find sparse solutions for its 
convexity. In fact, using £p-norm (0 < p < 1) can find sparser 
solutions than using £i-norm E^-ES^ but it is challenging 
to solve the corresponding non-convex optimization problem. 
In this paper, we manage to generalize our formulation to the 
non-convex £ 2 ,p-norm regularization case, which is expected to 
have better sparsity than £ 2 ,i-norm minimization ll26l . We de¬ 
velop a simple algorithm to solve our proposed Discriminative 
Feature Selection (DFS). The convergence of the algorithm is 
rigorously proved for p in (0, 2] which covers the range we 
are interested in. Our contributions are summarized as, 

• Prove the formulation of LDFS has a trivial solution of 
all zeros; 

• Propose a new formulation to avoid the trivial solution 
based on £ 21 -norm regularization and extend it to the 
£ 2 ,p-norm regularized cases. When 0 < p < 1, the £ 2 ,p- 
norm regularization is likely to offer better sparsity, thus 
the formulation is more suitable for feature selection; 

• Develop an efficient algorithm to address the £ 2 ,p-norm 
regularized optimization problem and rigorously proving 


that the algorithm monotonically decreases the objective 
of DFS with 0 < p < 2; 

• Evaluate DFS systematically on various types of real- 
world data sets to understand the ability of DFS to select 
discriminative features and remove redundant features. 

The rest of the paper is organized as follows. Section II 
states some necessary notations and definitions. A brief review 
of the LDFS approach is given in Section III. In Section IV, 
we will introduce the formulation of DFS and provide an 
efficient solution algorithm. Section V presents deep analysis 
of the proposed method, including convergence behavior, time 
complexity etc. Experimental results on various kind of data 
sets are displayed in Section VI. The conclusion and the future 
work are in Section VII. 


H. Notations and Definitions 


We introduce the notations and the definitions of norms used 
in this paper. Matrices and vectors are written as boldface 
uppercase letters and boldface lowercase letters respectively. 
Eor an matrix M = (fTiij), its i-th row, j-th column are 
denoted by m*, respectively. The £p-norm (p > 0) of 

/ n \p 

a vector v G K." is defined as ||v||p = ( ^ J y 


the £o-norm is defined as ||v||p = ^ |ui| . Actually, neither 

i=l 

£0 nor £p (0 < p < 1) is a valid norm, because the former 
does not satisfy the positive scalability; ||av||g = |q;| ||v||q for 
scalar a and the latter does not satisfy the triangular inequality 
(l|u + v||p ^ ||u||p -I- ||v||p, 0 < p < 1). We call them norms 
here for convenience. 

The £ 2 ,i-norm of an matrix M G is defined as fTTX 


l|M||2.i = E^ 






i=i \ j=i 

The £ 2 p-norm can be generalized to £r,p-norm 

1 

/ 




n I m 

E El 

i=i \i=i 


= El 


,r > 0,p > 0, 




( 1 ) 


( 2 ) 


n / m 


l^llr.0=E I ^ 

i=l \j=l / 


Wl,r>0. (3) 


Under this definition, the £j. o-norm of a matrix M is exactly 
the number of nonzero rows of M. 

When r > 1 and p > 1, £i.^p-norm is a valid norm as 
it satisfies the three norm conditions, including the triangle 
inequality ||A||^_p -f ||B||^p > ||A + B||^p. This can be 
simply proved as follows. Using the triangle inequality of ip- 
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TABLE I 
Notations 


Notations 

Descriptions 

d 

The dimensionality of the original data 

n 

The data size 

c 

The number of classes 

rik 

The number of data points in the fc-th class 

1 

The reduced dimensionality 

T 

The set of selected features 

Xi e R®* 

The 2 -th data point 

xf ' e R®' 

The 2 -th data point in the fc-th class 

X e R"’®'* 

The data matrix 

/I e 

The total sample mean vector 

g 

The mean vector of the fc-th class 

fi G R'" 

The samples for the 2 -th feature 

a G R*^ 

The transformation vector 

A G 

The transformation matrix 

St G 

The total scatter matrix 

Su, G 

The within-class scatter matrix 

S6 G 

The between-class scatter matrix 


norm:(E,kn^ + (E.kn^ > + (P > D 

and setting Ui = 11 a® 11 and = 11 b* 11 , then we obtain 

(eikk)%(eii-ii')* 

> fy |||a*|| +||b*|| fV k 

“Vt.. " 

^ lll^* ,T> 1 ,P> 1 , 

where the second inequality follows the triangle inequality for 
norms; ||a*||^ + ||b*t ^ 11 ^* +'^'L' ® is just ||A||^p + 
||B||^p > ||A + B||^p. However, when 0<r<lorb< 
p < 1, ir,p is not a valid matrix norm. Still, here we call them 
norms for convenience. The notations used in this paper are 
summarized in Table |I] 


III. Linear Discriminant Feature Selection 
Revisited 

In this section, after a brief review of LDA, we introduce the 
feature selection method LDFS, which is derived from LDA. 
Then we prove that there is a trivial solution of the formulation 
of LDFS. 

LDA is a popular supervised transformation-based dimen¬ 
sionality reduction method. It seeks directions on which the 
data points of different classes are far from each other, while 
data points in the same class are close to each other. Suppose 
we have a set of n samples X = [xi,X 2 , • • • ,x„]^ G 
belonging to c classes. The objective function of LDA is as 
follows ll28l : 



* a^Sta 

a = argmax 

a 3. 

(5) 

Sb = 

c 

yn,(/xW-/T)(/xW-/.)^ 

( 6 ) 


fc^l 


c 

/ T’\ 


= y] 

fc=i 


(7) 


where /r is the total sample mean vector, Uk is the number of 
samples in the fc-th class, is the average vector of the k- 

(k) 

th class, and x^ is the i-th sample in the fc-th class. We call 
Su, the within-class scatter matrix and S;, the between-class 
scatter matrix. 

Define St = EEi ~ ~ '^^e total scatter 

matrix, then we have St = Sf, -f S^,. The objective function 
of LDA in 0 is equivalent to 


a 


* 


= argmax 

a 


a^S^a 

a^Sta' 


( 8 ) 


When I projective functions A = [ai, a 2 , • • • , a;] are needed, 
the objective function of LDA can be written as 

A* = argmaxfr((A^StA) ^(A^SbA)), (9) 

AeR'ix' 


or 

A* = argmin - fr((A^StA)"\ a^S&A)). (10) 

AeR'^x! 


Before introducing the formulation of LDFS, we first an¬ 
alyze how the structure of A should be to achieve feature 
selection. To preserve the semantic consistency, here we 
denote the data’s j-th feature, i.e., the j-th column of X, as 
fj. If all the elements of the j-th row of A are zero, then 
feature fj makes no contribution to the low-dimensional data 
representation XA and it should be removed by the feature 
selection methods. On the other hand, if feature fj is selected 
by the feature selection algorithm, then there is at least one 
element of the j-th row of A to be nonzero. Hence, forcing 
the transformation matrix A to have more zero rows can be 
interpreted as selecting fewer features. Using this idea, M. 
Masaeli et ai convert LDA into a feature selection algorithm, 
LDFS, through ^ 00 , 1 -norm regularizatiorQ Il2^ : 


min 

AeR'^xi 


A'^SbA 

A^S^A 


+ 7l|A|k,i = 


A^SfcA 

A^S^A 


+7y 


00 ’ 


( 11 ) 

where 7 > 0 is the parameter that tunes the row sparsity of 
the transformation matrix A. Increasing 7 means forcing more 
rows to be zero, thus more features will be removed. The i^o- 
norm of the vector a* is the maximum of the absolute value 
of the elements of a® and the fi-norm induces sparsity. As 
a result, the formulation of LDFS imposes sparsity on the 
maximum absolute value of the elements of each row of A, 
thereby pushing all the elements of each row to zero. 

To optimize for the foo-norm, M. Masaeli et al. 1231 adopt a 
vector of dummy variables to represent the maximum absolute 
value of the elements of rows of the transformation matrix, 
transforming the formulation into an optimization problem 
with box constraints. A Quasi-Newton method is used to solve 
this problem, but the evaluation of the gradient of — is 

computationally very expensive. Moreover, El) has a trivial 
solution of all zeros. We prove this in Proposition [T] 


1 .A.^ S .A. 

^The first term formulation should be converted into a 

number, however, in 1231 the authors do not specify which criterion of LDA 
is used. Thus, we just keep the original formulation in E). 
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Proposition 1 , The formulation of LDFS defined in (O has 
a trivial solution of all zeros. 

Proof Let J'(XA) = - +'T'II^IIoo,i suppose A* 

is a solution of (fTTI) . then cA* is a better solution of (fTTT l. 
where c is a nonzero constant with |c| < 1 : 


J(X(cA*)) 


c‘^A*^S^A* 


IcA* 


I 00,1 


A*^SbA* 

\*Ta \* 

.i*. qjj .LM. 

A*^St,A* 

< --- 

~ A A * 

.Ti 

= J(XA*). 


+ 7|c|||A*|U^, 
+ 7l|A1L,i 


( 12 ) 


When c 0 , which means cA* —>■ 0 , j 7 (X(cA*)) —> 
— a***s'’a* , ^ tl'ivial solution of (fTTl i. | 


Recall the desired structure of A to achieve feature selec¬ 
tion, it is not difficult to see that the formulation would lose its 
ability to select features when arriving at the trivial solution. 
In the experiments in 12^ . the value of 7 is increased until the 
desired number of rows of A, i.e., the number of features to 
be removed, are close to all-zero (the maximum value of the 
row is less than 0 . 01 ). By this way, the implementation can 
find a nonzero solution but probably not the optimal solution. 


IV. Discriminative Feature Selection Based on 
^ 2,p-NORM Regularization 

As proved above, the LDFS algorithm proposed in ll 23 l 
has a trivial solution of all zeros. It may lose its function of 
selecting features when it leads to a solution near the trivial 
solution. In this section, a new formulation is proposed to 
avoid the trivial solution. We achieve this by constraining 
the transformation vectors in LDA to be uncorrelated. As the 
optimization of ioo.i-norm minimization involves extensive 

A T Q A 

computation of evaluating the gradient of ~ a^s A ’ ^2,1-norm 
is adopted instead, and the resulting minimization problem 
can be solved more easily. Furthermore, the formulation is 
generalized to the f2,p-norm regularized cases, providing more 
choices of p values to fit the variety of sparsity requirements. 
We develop a very simple algorithm to solve the ^2,p-norm 
minimization problem uniformly, and prove it to be convergent 
when 0 < p < 2 in next section. For convenience, we refer 
our proposed formulation as Discriminative Feature Selection 
(DFS). 


A. Discriminative Feature Selection Based on £2p-Norm Reg¬ 
ularization 

According to ([8]l and (fTOl i the formulation of LDFS is 
equivalent to solving the following problem, 

min - (fr(A^StA) ^(A'^St,A)) + 7||A|| ( 13 ) 

AeRdxl 

In general, the regularization term can be set in the form of 
fr,i-norm with 1 < r < 00. In multi-task feature learning, 
the choice of r depends on the priori feature sharing between 
the tasks, from none (r = 1) to complete (r = 00) ll2^ . 


In fact, a larger r value means allowing better “group 
discounts” for sharing the same feature: r = 1 means linearly 
growing costs with the number of tasks that use a feature, 
and r = 00 suggests that only the most demanding task 
matters EqI . For single-task learning, increasing r corresponds 
to more sparsity sharing between the elements in each row of 
A: from individual element-level sparsity patterns (r — 1 ) 
to row-level sparsity patterns (r = cxd). To perform feature 
selection, we need to push A to have zero rows and then 
remove the corresponding features. Hence, having individual 
sparsity patterns, i.e., choosing r = 1, is not suitable for 
feature selection. Thus, in order to impose row sparsity on the 
transformation matrix A to reach the desired configuration of 
feature selection, the regularization term can be set in the form 
of £r,i-norm with 1 < r < 00. 

Here, we adopt ^2,1-norm regularization as the regularizer 
for the following two reasons. Firstly, the £2,1-norm minimiza¬ 
tion problem can be solved by a iterative algorithm ll 27 l . which 
is much easier than that of the £00,1-norm. Secondly, when 
p 0 , £2,j,-norm and £oo,p-norm have very similar properties, 
and they both denote the nonzero rows of A when p = 0 . So, 
the reformulated formulation through £27-norm regularization 
is 


min -fr((A^StA) \A'^St,A))-f 7IIAII2 j. ( 14 ) 
AgB'^x' 


To avoid arbitrary scaling and the trivial solution of all 
zeros, we constrain the transformation vectors of LDA to be 
uncorrelated w.r.t Sj, i.e., A^S^A = I, as in the uncorre¬ 
lated LDA algorithm ll 28 l . Then the formulation of £27-norm 
regularized DFS becomes 

min - £r(A'^Sf,A)-f 7||A|L ,. ( 15 ) 

AgR-ixfA^StA^I 

Once we have got the optimal transformation matrix A*, 
we can rank each feature according to ||a*® in descending 
order and select the top ranked features, where a** is the i-th 
row of A*. 

The formulation of DFS in (fTSI l not only avoids the trivial 
solution but also preserves the advantage of the original 
LDFS: it can optimize for feature relevance and redundancy 
removal automatically ll 23 l . More specifically, the features are 
selected by using the linear transformation matrix A*, which 
is learned by the algorithm, so the combinations of these 
features can lead to the optimal value of the objective of 
LDA. Hence, DFS selects the most discriminative features, 
and also the interactions among the features are taken into 
consideration. Furthermore, the proposed formulation discards 
redundant features automatically. Adding redundant features, 
i.e., features that are correlated to the discriminative features, 
into the selected feature subset, will not decrease the value 
of —£r(A^St,A), but will increase ||A||2 ^ and then increase 
the objective value of DFS. Therefore, in the process of 
minimizing the objective function, the redundant features will 
be eliminated automatically by DFS. 
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B. l 2 ,p-Norm Regularized Discriminative Feature Selection 

Recall that selecting fewer features means enforcing the 
transformation matrix A to have more zero rows. Thus, the 
exact formulation of LDA-based feature selection is 

min - fr(A^S{,A) + 7||A|L (16) 

AeR‘'x',A''’StA=I 

which is an £ 2 ,o-noTm minimization problem. However, the 
problem (fTSl l is difficult to solve as it is a NP-hard combina¬ 
tional optimization problem. 

Extensive computational studies have showed that using ip- 
norm (0 < p < 1) can find sparser solution than using £i-norm 
ll24l . 1251. Naturally, one will expect f 2 .p-norm (0 < p < 1) 
based minimization to be a better sparsity pattern than ^ 2 , 1 - 
norm. The experimental results in ll26ll show that ^ 2 ,p-norm 
minimization for some 0 < p < 1 does find sparser solu¬ 
tion than f 2 .i-norm minimization. Thus, the NP-hard feature 
selection problem can be relaxed to the following problem: 

min -fr(A^SbA)-|- 7 ||A||f p-)-0. (17) 

AGR‘ixfATStA=I 

Obviously, this formulation reduces to (fTSl l when p = 1 . 

It can be easily found that the value of p balances the 
sparsity and the convexity of the regularization term. When 
p = 0 , the regularizer is ideal for feature selection in the 
sense of producing sparse solutions, but it is not convex. On 
the other hand, £ 2 ,i-norm is the closest convex approximation 
to the f 2 ,o-iiorm but the sparsity is weakened. In other words, 
the closer the value of p is to 0 , the better approximation the 
formulation is to the feature selection problem. 

Since dnii involves ^ 2 ,p-norm regularization, it is hard 
to derive its closed solution directly. In ll27l . an iterative 
algorithm has been proposed to solve the joint £ 2 ,i-norm 
minimization problem of both the regression loss function 
and the regularizer. The convergence of the algorithm is also 
proved in the same literature. The similar technique is used 
in m to minimize the Schatten p-Norm with 0 < p < 2 for 
matrix completion. Inspired by these two works, we develop 
a simple united algorithm to solve our proposed DFS for p in 
( 0 , 2 ], which will be introduced in next subsection. 


C. l 2 ,p-Norm Minimization Algorithm 

In this subsection, we present a simple united algorithm to 
solve our proposed formulation for both the convex regularized 
case (1 < p < 2 ) and the non-convex regularized case (0 < 

p < 1 ). 

For convenience, we denote £(A) = ||A|| 2 p. Note that the 
derivative of i3(A) w.r.t A is 


dC{A) 

dA 


= 2DA, 


( 18 ) 


where D £ is a diagonal matrix with the i-th diagonal 

element as 

= f lla'Iir"- (19) 

When D is fixed, the derivative in (fTTl l can also be regarded 
as the derivative of the following objective function: 


-tr{A'^SbA) -f 7tr(A^DA). (20) 


Consequently, the problem in (fTTI) can be addressed by solving 
the following problem iteratively: 

min — fr(A^ShA)-f 7 tr(A^DA), ( 21 ) 

AeR'ixi,ATStA=I 

where D is defined as in ( fT^ . 

Rewrite •ED, we get 


min fr(A^( 7 D — S^jA). 

AeR‘*x7A^StA=I 


( 22 ) 


Solving problem (l22l i is equivalent to find the I eigenvectors 
associated with the minimum I eigenvalues of the following 
generalized eigen-problem: 


( 7 D - Sb)a = XSta. (23) 


Note that D is dependent on A and thus D is also an 
unknown variable. We propose an iterative algorithm in this 
paper to obtain the solution A such that (l22l l is satisfied, and 
prove in the next section that the proposed iterative algorithm 
will monotonically decreases the objective of the problem in 

•O. 

The algorithm is described in Algorithm 1. In each iteration, 
A is calculated with the current D, and then D is updated 
based on the current calculated A. The iteration procedure is 
repeated until the algorithm converges. 


Algorithm 1 DFS 

Input: Data matrix X, label information, parameters: 7 , I, 
P- 

Output: A G 

1: Compute the scatter matrix St, S;,; 

2: Set k = 0. Initialize G as an identity matrix; 

Repeat 

3: Solve the generalized eigen-problem ( 7 Dfc — Sh)a = 
ASta; 

4: Ak+i = [ai,a 2 ,--- ,a;], where ai,a 2 ,--- , a/ are the 
eigenvectors associated with the first I smallest eigenvalues; 
5: Calculate the diagonal matrix D/^+i, where the i-th 
diagonal element is | ||^fe+i ||2 ’ 

6 : k = k + 1', 

Until converges 


Remark 1. To get a stable solution of the generalized eigen- 
problem ( 1231 ). St is required to be nonsingular. This is clearly 
not true when the number of features is larger than the number 
of samples. We apply the idea of regularization, by adding 
some constant values to the diagonal elements of St as St+ctl 
for some a > 0. It is easy to see that St + oil. is nonsingular. 


Remark 2. When computing D, its diagonal element da is 
^ ||a *||2 .In practice, ||u *||2 could be very close to zero but 
not zero. However, can be zero theoretically. In this 

case, da — 0 is a subgradient of ||A|| 2 p w.r.t a®. We can not 
set da = 0 when a® = 0 , otherwise the derived algorithm will 
not be guaranteed to converge. Instead, we regularize da as 

dti = I (-\- O , and the derived algorithm can be 
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proved to minimize —ir(A^Sf,A) + 7 ^ in- 

Stead of—tr{A'^SbA)-\-'y ^ = —ir(A^S5A) + 

i=l '' 

7 IIAII 2 p. It is easy to see that ^ (^(a*) a® + C,j approxi¬ 
mates ||A|| 2 p when C — 0 . 

V. Discussions 

This section gives an analysis of DFS in three aspects. 
We first provide the convergence behavior of the algorithm 
and then discuss time complexity and parameter determination 
problems. 


where vectors a^ and denote the i-th row of matrix 

Afe and A^+i respectively. On the other hand, according to 
Lemma [U for each i we have 


1 ® ir 

*fe+i II 2 

la® F 


p lla; 
2 


< 1 _P 


fc II 2 


(28) 


which is equivalent to the following inequality 


\\P-P 
*fc+i II2 2 


F+ 1 II 2 ^ \\i IIP P 


72^ — \\^k\\2 


fc II2 


\2-p ’ 


^fc|l 2 


^fc ||2 


(29) 


so the following inequality holds: 


A. Algorithm Analysis 

In this subsection, we prove that the objective function of 
dTTl i is non-increasing under the updating rules of A and D 
in Algorithm [T] Firstly, the following lemma is introduced. 


Lemma 1. When 0 < p < 2, for any nonzero vectors a, a^, 
the following inequality holds: 


IFIIa _ P IFIIa < T P 
Ffcll! 2 ||a,|| 2 - 2 - 


(24) 


Proof Denote (p{f) = + | — 1, then we have 


p'{t) = ppP ^ — pt = ptfP ^ — 1). 


E 


a 


■fc-ri II2 


IP P IFfc+iII 2 


|2-p 


^E 


II2 


I i IIP P |Ffc||2 

Ffcih 2 ii_i i|2-p 


a' 


k\\2 


Combining dZTl i and (l30l) . we have 

d 

- fr(A^+iSt,A^+i) + 7 E 11 ^^ 

i^l 

d 

< -if(A^ShA^) + 7 ^ ||a; 
That is to say. 


(30) 


p 

fe-i-i II 2 


(31) 


p 

112 ■ 


i=l 


Obviously, when f > 0 and 0<p<2, f = lis the only 

point so that p'{t) = 0. Note that p'{t) > 0 {0 < t < 1) 

and p'{t) < 0 (f > 1), so f = 1 is the maximum point. As 

:p(l) = 0 , thus when f > 0 and 0 < p < 1 , ip{t) < 0 . 

Therefore, let t* = in (fit), then ) < 0, that is 

’ llafella - 

to say 

^_p^ p_ 

||a,||^ 2 ||a ,||2 + 2 “ ’ 

After a transposition, we arrive at (l24l) . | 

Theorem 1. When 0 < p < 2, the Algorithm |7] will 
monotonically decrease the objective of the problem in (O 
in each iteration, and converge to the local optimum of the 
problem. 


Proof In the A:-th iteration 


Afc+i = argmin — fr(A^Sf,A)-7 7 fr(A^DfcA), 
AeR'*x',A'rStA=I 


which indicates that 

- fr(A^+iSt,A^+i) -f 7tr(A^+iDfeA^+i) 
< -if(AfeSfaAl’) + 7 fr(A^DfcAfc). 

That is to say, 

< -H^k^bAl) -y 

i=l ^ iFfelb 


Il 4 

||2 

II 2 

^fell 

2 

2 


|2-p > 


(25) 

(26) 


(27) 


- fr(A|+iSf,Aj+i) + 7 ||Afc+i||^_p 

<-triAlS,Al)+y\\A,\\P^^. 

Thus the Algorithm [T] will monotonically decrease the ob¬ 
jective of the problem in (fTTl) in each iteration k. Note that the 
objective function has lower bounds, so the above iteration will 
converge. Therefore, the Algorithm [T] monotonically decreases 
the objective value in each iteration till the convergence. | 

As we use the transformation matrix A to select features, 
we also need to make clear the convergence behavior of 
it. Following im, we measure the divergence between two 
sequential As by the following metric: 

d 

Div{k) = XI lll^fc+iL “ IkfelU- (33) 

i=l 

The metric defined above acts as an indicator to show whether 
the final results would be changed drastically. 


B. Time Complexity 

To optimize the objective function of DFS, the most time 
consuming operation is to solve the generalized eigen-problem 
( 7 Dt —Sf,)a = AS^a. The time complexity of this operation is 
0{d^) approximately. Empirical results show that the conver¬ 
gence is fast and only several iterations are needed to converge. 
Therefore, the proposed method scales well in practice. 
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C. Parameter Selection 

Parameter selection is of great importance and it is still an 
open problem. At present, the most commonly used parameter 
selection method is grid search based on the cross validation 
accuracy (CV-Acc), i.e., to search the optimal parameter 
corresponding to the highest CV-Acc. Sometimes, we also 
determine parameters based on experience HD. 

If we take p in the £ 2 ,p-norm regularization term as a 
parameter, then DFS has three parameters: the reduced di¬ 
mensionality I, regularization parameter 7 and p. As for the 
reduced dimensionality I, we empirically set it to be c — 1 
as in traditional LDA ||28l, where c is the number of classes. 
The regularization parameter 7 controls the trade-off between 
the discrimination and the sparsity. It plays an important role 
in DFS for feature selection. We determine it by grid search 
according to CV-Acc and some numerical results are presented 
to illustrate its impact on DFS. Then, the value of p, which 
balances the sparsity and convexity of the formulation, is also 
hard to decide. As the transformation matrix need to be row- 
sparse, we only focus on cases that 0 < p < 1 , though 
the algorithm is proved to be convergent when 0 < p < 2. 
To simplify the experiment, p is set as 1 when comparing 
with other feature selection approaches and the performance 
of DFS with different p values is studied separately. Finally, 
the number of selected features, a common parameter for all 
feature selection methods, is difficult to determine without 
prior. Hence, we vary this parameter within a certain range 
and report the corresponding results. 

VI. Experiments 

In this section, experiments are conducted to evaluate the 
performance of our proposed algorithm. Firstly, a toy example 
is displayed to show the ability of DFS to find the dis¬ 
criminative features. Then we compare DFS with hve widely 
used filter-type feature selection methods, and following this, 
comparisons between DFS with different p values are made. 
We also evaluate how DFS performs with varying values of 
the regularization parameter 7 . Finally, convergence analysis 
and computational time are reported. 

A. Data Set Description and Evaluation Metrics 

In our experiments, six diverse public data sets are collected 
to illustrate the performance of different feature selection 
approaches. These data sets include three image data sets, 
COIL2C|l ORiJE and USPS0 two biological gene expression 
microarray data sets. Colon Tumo][| (COLON) and Lung 
Cancei@ ^UNG), and one spoken letter recognition data, 
ISOLET^ All data sets are standardized to be zero-mean and 
normalized by standard deviation. We summarize the statistics 
of the data sets in Table |II] and briefly introduce them as 
follows, 

^http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php 

^ http://www.zjucadcg.cn/dengcai/Data/EaceData.html 

‘*http://www.cc.gatech.edunsong/data/icml_data.zip 

^http://www.upo.es/eps/bigs/datasets.html 

®https://sites.google.com/site/feipingnie/resoure 

^https://archive.ics.uci.edu/ml/datasets/ISOLET 


TABLE II 

Data Set Description 


Data set 

Size 

# of Features 

# of Classes 

Type 

COIL20 

1440 

1024 

20 

Image, Object 

ORL 

400 

1024 

40 

Image, Face 

USPS 

730 

256 

10 

Image, Handwritten 

COLON 

62 

2000 

2 

Microarray, Biological 

LUNG 

203 

3312 

5 

Microarray, Biological 

ISOLET5 

1559 

617 

26 

Voice, Alphabet 


• COIL20 contains 1440 images of 20 objects. The images 
of each object were taken 5 degree apart as the object 
is rotated on a turntable, and each object has 72 images. 
The size of each image is 32x32 pixels, with 256 gray 
levels per pixel. Thus, each image is represented by a 
1024-dimensional vector. 

> ORL consists of 400 face images. There are 10 different 
images of each of 40 distinct subjects. For some subjects, 
the images were taken at different times with varying 
lighting, different facial expressions and facial details. 
The original size of each image is 92 x 112, with 256 grey 
levels per pixel. In our experiments, the size is reduced 
to 32x32. 

« The original USPS handwritten digit database contains 
9298 images. The size of each image is 16x16 pixels 
and each image is characterized by a 256-dimensional 
vector. The version we use in this paper is a balanced 
random sample of the original data set produced by L. 
Song et al. |[32l. 

• COLON contains expression levels of 2000 genes taken 
from 62 different samples. For each sample it is indicated 
whether it comes from a tumor biopsy or not. 40 samples 
are normal and the rest of the samples are from tumor 
biopsy. 

> LUNG is composed of 203 samples in hve classes with 
139, 21, 20, 6 , 17 samples respectively. Each sample has 
12600 genes. The genes with standard deviations smaller 
than 50 expression units were removed and the remaining 
data set contains 203 samples with 3312 genes. 

• The ISOLET data set was generated by letting 150 sub¬ 
jects speak the name of each letter of the alphabet twice. 
Hence, from each speaker we have 52 training examples. 
The data was divided into 5 equal sets of 30 speakers 
each. We use the data from the hfth part, ISOLET5. As 
one example in this part is missing, ISOLET5 has 1559 
examples in 26 classes with 617 attributes. 

To test the quality of the selected features, two metrics are 
used: accuracy - the classihcation accuracy achieved by the 
classiher using the selected features; redundancy rate - the 
redundancy rate contained in the selected features. An ideal 
feature selection algorithm should select features that results in 
high accuracy, while containing few redundant features El. 

We use the LIBSVA0 software to perform classihcation, 
which implements the “one-against-one” approach for multi¬ 
class cases, see more details in 13^ . Eollowing lIZTl . ll^ . the 
SVM classiher is individually performed on each data set with 

*https://github.com/cjlinl/libsvm 














Fig. 1. A toy example. Top: The test sample of ORL data from the 18th class 
with different numbers of selected features; Bottom: The test sample of ORL 
data from the 38th class with different numbers of selected features. 


the selected features, using the linear kernel with the parameter 
C = 1 and 5-fold cross validation. The average classification 
accuracy of all these 5 folds is reported as the final result. 

Assume T is the set of selected features, and Xjr is the 
data represented by the features in T. We use the following 
measurement to measure the redundancy rate of T iflTl : 

RED{F) = 1^1 corn^j, (34) 

where | is the cardinality of J-, i.e., the number of selected 
features, and corrij is the correlation between two features, 
fi and fj. This measurement assesses the averaged correlation 
of all feature pairs in T. A large value indicates that many se¬ 
lected features are correlated, and thus redundancy is expected 
to exist in E. 


B. A Toy Example 

In this subsection, we present a toy example to illustrate the 
ability of DFS to select discriminative features. Specifically, 
two samples from each class of the ORL data set are randomly 
selected as the training data, and the rest of the examples 
are used for testing. We perform our method on the training 
data. The top ranked {32, 64, 128, 256, 384, 512, 640, 768, 
896, 1024} features are selected respectively. Then, the images 
of two randomly selected testing samples are recovered by 
using different numbers of selected features, from 32 to all. 
For illustration, the unselected features are set to be white 
and the selected features maintain their original values. The 
recovered images are displayed in Fig. [T] from left to right, 
the number of selected features is (32, 64, 128, 256, 384, 512, 
640, 768, 896, 1024} respectively. 

From Fig. [T] we draw the following conclusions. The 
recovered images show that DFS preferentially selects features 
corresponding to the eyes, nose and mouth. They are the most 
discriminative features that could describe each individual’s 
character. We can see that with only 64 features selected, the 
eyes, nose and mouth of each testing sample are already clear. 
While the pixels of the skin are the background, and they have 
been dropped by our method in most cases. We also noted that 
the features selected by DFS are well distributed and do not 
gather at a certain part of the face. This indicates that DFS 
manages to remove the redundant features. 


C. Comparison between DFS and Other Filter-type Feature 
Selection Algorithms 

If we tune the values of 7 and p simultaneously, the 
experiment will become complicated . For simplicity, we set 
p = 1 for DFS when comparing it with other algorithms, 
since the efficiency of f 2 ,i-norm in feature selection has been 
demonstrated in many studies lfT9l , lIZTl . Il3^ . The effect 
of p on the performance of DFS will be studied separately 
in next subsection. As LDFS has a trivial solution and the 
proposed implementation in ll 2 ^ is hard to reproduce, we 
do not make comparison with LDFS. We compare the DFS 
algorithm with the following widely used filter-type feature 
selection algorithms, 

• BAFISIC II 32 I . which is a backward elimination feature 
selection method that employs the Hilbert-Schmidt Inde¬ 
pendence Criterion (HSIC) as a measure of dependence 
between the features and the labels. 

« Laplacian Score (LS) IIT3II which evaluates the features 
according to their ability of preserving the local manifold 
structure. 

. mRMR na, which selects features that are mutually 
far away from each other and have “high” correlation 
to the classification variable according to the minimal- 
redundancy-maximal-relevance criterion based on mutual 
information. 

. ReliefF (RF) HI, which evaluates features based on 
how well the feature differentiates between neighboring 
instances from different classes versus from the same 
class. 

■ Trace Ratio (TR) llTSll . which selects a feature subset 
based on the corresponding subset-level score that is 
calculated in a trace ratio form. 

For BAHSIC, a linear kernel is used on both data and 
labels. We need to tune the bandwidth parameter for the 
Gaussian kernel in Laplacian Score, and decide the number 
of the nearest neighbors used per class in ReliefF. For DFS 
{p = 1), the reduced dimensionality I is set as c — 1, as in 
traditional LDA ll28l . We also need to tune the regularization 
parameter 7. We decide the undetermined parameters in a 
heuristic way by grid search. For Trace Ratio, the weight 
matrices are constructed in the same manner as Fisher Score 
(refer to llT5l for details). The implementations of BAFISIC, 
Laplacian Score and Trace Ratio are downloaded from the 
authors’ websites. The code of mRMR and ReliefF are from 
the ASU Feature Selection Repositor}0. We set the number of 
selected features between 10 and 100 with an interval of 5 for 
all data sets. Each feature selection algorithm is first performed 
to select features. Then LIBSVM software is employed to 
classify samples represented by the selected features, using 
linear kernel and 5-fold cross validation. We report the average 
classification accuracy of these as the final result. 

Fig. |2] shows the classification accuracy computed by the 
SVM classifier on six data sets using different feature selection 
algorithms. Table |III] and Table |IV] show the detailed experi¬ 
mental results using the top 20,40, 60, 80 features respectively. 
The last line of both tables is the average accuracy over all the 

®http://featureselection.asu.edu/index.php 
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Fig. 2. Comparison between DFS and other filter-type feature selection methods w.r.t classification accuracy. 


TABLE III 

Classification Accuracy (%) of SVM Using 5-Fold Cross Validation for Top 20 and 40 Features Respectively 


Average accuracy of top 20 features (%) Average accuracy of top 40 features (%) 



BAHSIC 

LS 

mRMR 

RF 

TR 

DFS 

BAHSIC 

LS 

mRMR 

RF 

TR 

dfS 

COIL20 

46.81 

75.56 

76.04 

65.21 

75.76 

93.54 

64.44 

85.14 

89.10 

72.50 

86.04 

97.50 

ORL 

63 

82.50 

76.50 

76 

70.75 

88 

76.25 

89.75 

80.25 

87.50 

88.50 

94.50 

USPS 

62 

69.32 

75.89 

71.37 

70.55 

77.81 

74 

78.63 

80.55 

78.08 

80.68 

86.58 

COLON 

79.03 

83.87 

80.65 

82.26 

79.03 

93.55 

75.81 

77.42 

80.65 

75.81 

75.81 

100 

LUNG 

87.19 

91.63 

93.60 

88.67 

82.27 

97.04 

94.58 

94.58 

94.09 

95.07 

90.15 

97.54 

ISOLET5 

50.61 

60.04 

75.18 

57.60 

60.04 

75.63 

56.19 

71.52 

79.47 

67.09 

71.52 

87.49 

Average 

64.77 

77.15 

79.64 

73.52 

73.07 

87.60 

73.55 

82.84 

84.02 

79.34 

82.12 

93.94 


data sets for each feature selection approach. Fig. [3 represents 
the corresponding redundancy rate when different numbers of 
features are selected by different feature selection algorithms. 

As shown in Fig. |2] with the increase in the number 
of selected features, the trend of classification accuracy of 
different feature selection methods on different data sets varies. 
For data sets COIL20, ORL and ISOLET5, all feature selection 
approaches achieve higher classification accuracy with more 
selected features. A similar tendency can be found on the 
USPS data set, with only BAHSIC’s performance fluctuating 
widely. On the data set COLON, the classification accuracy 
achieved by each method fluctuates within a certain range. On 
the LUNG data set, DFS and mRMR level off at about 97.00% 
and 94.50% respectively, while the classification accuracy of 
other methods increase with fluctuation. 

Interestingly, on data sets COIL20, USPS and ISOLET5, the 
classification accuracy achieved by Laplacian Score and Trace 
Ratio are approximately the same. The reason may be that 
with a special graph structure, Laplacian Score is equivalent 
to Eisher Score, and the weight matrices in Trace Ratio are also 


constructed in the Eisher LDA manner. While on the data sets 
ORL, COLON and LUNG, Laplacian Score surpasses Trace 
Ratio. 

In terms of the classification accuracy, most of the time 
DES outperforms all the baseline methods on all data sets. 
Especially, on the COLON data set, DES achieves 8.07% 
to 19.35% improvement compared to the best result of all 
the other methods. We compute the average classification 
accuracy over all data sets for each method using the top 
20, 40, 60, and 80 features respectively. On average, DES 
consistently outperforms the other five methods on all data 
sets. The mRMR algorithm performs the second best when 20 
and 40 features are selected. Laplacian Score replaces mRMR 
when 60 and 80 features are used. Compared with mRMR 
or Laplacian Sorce, DES obtains 7.96%, 9.92%, 7.31% and 
6 .22% relative improvement respectively. 

Erom Eig. [2 we can see that feature subsets selected by 
DES on all data sets consistently have a low redundancy 
rate. DES selects features whose combination can lead to 
directions where data points from different classes are far 
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TABLE IV 

Classification Accuracy (%) of SVM Using 5-Fold Cross Validation for Top 60 and 80 Features Respectively 


Average accuracy of top 60 features (%) Average accuracy of top 80 features (%) 



BAHSIC 

LS 

mRMR 

RF 

TR 

DES 

bAhSIC 

LS 

mRMR 

RF 

TR 

dfS 

COIL20 

77.57 

87.99 

91.46 

78.40 

90.07 

98.33 

84.79 

90.56 

94.58 

83.75 

93.61 

98.75 

ORL 

80 

92.75 

82.50 

91.50 

91.50 

96.25 

81.25 

95.75 

84 

92.50 

93.25 

94.75 

USPS 

74 

83.70 

85.34 

84.38 

85.21 

91.10 

78 

86.71 

88.08 

86.58 

87.40 

91.10 

COLON 

82.26 

87.10 

90.32 

80.65 

82.26 

98.39 

83.87 

87.10 

85.48 

85.48 

82.26 

100 

LUNG 

94.09 

94.58 

95.07 

93.60 

90.64 

97.04 

94.09 

94.09 

95.57 

96.06 

91.63 

97.04 

ISOLET5 

63.69 

80.69 

81.21 

77.74 

80.69 

89.54 

66.58 

81.14 

84.03 

83.26 

81.14 

91.08 

Average 

78.60 

87.80 

87.65 

84.38 

86.73 

95.11 

81.43 

89.23 

88.62 

87.94 

88.22 

95.45 
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Fig. 3. The redundancy rate of the sets of features of different size selected by different feature selection methods. 


from each other. Similarly, ReliefF’s evaluation criterion is to 
select features that contribute to the separation of the samples 
from different classes. However, ReliefF shows no advantage 
in handling feature redundancy. 

One point should be highlighted here. As seen from the re¬ 
sults, in most cases, the redundancy rate of the selected feature 
subset decreases as the number of selected features increases. 
This seems to be counter-intuitive. Redundancy stems from 
the inter-correlation between the selected features, and thus the 
total amount of redundancy should increase when the selected 
feature subset is enlarged. However, the redundancy rate is 
calculated by averaging the feature-feature inter correlation 
coefficients. It indicates the mean redundancy level of the 
selected feature subset, not the total amount of redundancy. 
Therefore, the redundancy rate of the enlarged feature subset 
can be higher or lower than that of the original one. In the case 
that the classihcation accuracy can be guaranteed, the main 
goal of feature selection is to reduce the redundancy as well 
as the number of features. In practice, the number of selected 
features is predetermined, so feature selection algorithms are 


generally designed to seek for high classification accuracy 
and low redundancy. Compared with the other methods, DFS 
manages to achieve this goal more successfully. 

Combining Fig. |2] and Fig. [2 we know that there is no 
dehnite relationship between a feature subset’s discriminative 
power and its redundancy rate. That is to say, a feature subset 
with higher discriminative power does not necessarily have 
lower redundancy and vice versa. 

In summary, DFS, which combines discriminant analysis 
and f 2 ,p-norm regularization, can enhance the feature selection 
performance in terms of classihcation. There are two main 
reasons for this. First, DFS selects features jointly by using 
the learning mechanism, hence, the interactions among the 
whole set of features are considered. Second, the optimization 
of DFS impels it to select the most discriminative features and 
remove the redundant ones simultaneously. 

D. Comparison of DFS with Different p Values 

The value of p balances the sparsity and convexity of the 
formulation of DFS. The closer to 0 the value of p is, the 
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sparser the representation is. In this subsection, we compare 
the performance of DFS with different p values. Note that 
our goal in extending f 2 ,i-iionn regularization to £ 2 ,p-^orm 
regularization is to find sparser solutions. We only consider 
the cases when 0 < p < 1 despite the fact that our algorithm 
is convergent for all p G (0,2]. In the experiments, data sets 
ORL, USPS and ISOLET5 are employed. The value of p is 
set as {0.001, 0.01, 0.1, 1}. Since DFS with different^ values 
may have different optimal regularization parameters, we tune 
this parameter for each p and report the best results. Fig. |4] 
shows the results. 

For brevity, we denote the performance of DFS with p = c 
as DFS(p = c). From Fig. @1 we can see that DFS with 
different p values do lead to different results. On the data 
set ORL, the results of DFS for p = 0.001, 0.01, 0.1 and 
1, are very close. DFS(p = 1) is slightly behind the others 
when the selected features are more than 65. On the data set 
USPS, DFS(p = 0.1) outperforms DFS(p = 1) except when 
the number of selected features is 10. DFS(p = 0.001) and 
DFS(p = 0.01) show advantage over DFS(p = 1) when the 
selected features are fewer than 45. While on the ISOLFT5 
data set, when p is less than 1, the performance of DFS 
consistently surpasses that of DFS(p = 1). 

Though smaller p value means sparser representation, the 
classification accuracy does not monotonically increase when 
p decreases, as the results show. A possible reason is that 
our proposed algorithm only guarantees the local optimum for 
non-convex cases. Another reason may be that in the practical 
implementation of the algorithm, we may not manage to find 
the optimal value of the regularization parameter 7 for each p. 
On the other hand, in some cases, DFS with positive fractional 
p values does find better solution than that when p — 1. This is 
evidently demonstrated by the results on the data set ISOLFT5. 

E. Impact of 7 on The Performance of DFS 

The regularization parameter 7, which controls the trade¬ 
off between the criterion of LDA and the row sparsity of A, 
plays an important role in DFS for feature selection. In this 
subsection, we study how the performance of DFS will be 
affected by different 7 values. Without loss of generality, we 
investigate the effect of 7 on DFS(p = 0.1) and DFS(p = 1) 
on data sets ORL, COLON and ISOLET5. The value of 7 
is set as (10-®, 0.01, 0.1, 1, 10, 100, 10“^, 10®} and 

the number of selected features varies from 10 to 100 with 
an interval of 10. The performance variance w.r.t 7 and the 
number of selected features is showed in Fig. |5] 

As seen from Fig. |5] DFS(p = 1) and DFS(p = 0.1) have 
similar performance variance trends w.r.t 7 on each data set, 
but different optimal 7 values. The degree to which being 
affected of DFS by the value of 7 differs on these three 
data sets. From left to right, the performance variance w.r.t 
7 grows. With the increasing of 7, the classification accuracy 
first ascend and then descend for both p = 0.1 and p = 1 on all 
data sets. When the number of selected features is small, the 
performance of DFS is more sensitive to 7. The performance 
variance created by varying 7 is comparable with that brought 
by different numbers of selected features. 


E Convergence Analysis and Time Comparison 

To validate the efficiency of our proposed algorithm to solve 
DFS that involves ^ 2 ,p-norm minimization, we present the con¬ 
vergence behavior curves of Algorithm [T] when p = 0.1, 0.5,1. 
Two kinds of results are provided. The first concerns the 
objective function and the other the divergence between two 
consecutive As, as shown in (l35t . We show the results on 
data sets COIL20 and COLON since the algorithm has similar 
convergence behavior to the other data sets. The convergence 
curves are displayed in Fig. [h] 

As seen from Fig. |6l the objectives of DFS with p = 
0.1, 0.5,1 are non-increasing during the iterations, and they 
all converge to a fixed value. Additionally, in all cases, the 
divergence between two sequential As converges to zero, 
which indicates that the final results will not be changed 
drastically. Furthermore, DFS converges within 20 iterations 
on this two data sets for the three p values. Therefore, our 
proposed DFS scales well in practice because of the fast 
convergence speed. 

We report the computational time of DFS(p = 1) and 
the other five baseline methods on two representative data 
sets COIL20 and ISOLET5. All the algorithms are tested on 
a laplop with 4 processors (2.27 GHz for each) and 5.87 
GB available RAM memory by Matlab implementation^^ 
The results are shown in Table |V] As we have analyzed in 
Subsection B of Section V, eigen-decomposition is the most 
time consuming operation of DFS and it is performed in 
each iteration, thus DFS takes longer time. Similarly, BAHSIC 
involves iterations and needs to renew the data kernel matrix 
in each iteration, so it costs the most time in both cases. 


TABLE V 

Computational Time Comparison on Data Sets COIL20 and 
ISOLET5 



BAHSIC 

LS 

mRMR 

RE 

TR 

DFS 

COIL20 

1135.93 

0.41 

7.25 

51.42 

1.23 

96.56 

ISOLET5 

639.01 

0.35 

6.78 

38.81 

0.85 

63.82 


VII. Conclusion and Future Work 

In this paper, a new formulation is propounded by combin¬ 
ing LDA and sparsity regularization for feature selection. In 
particular, we manage to extend the £ 2 .i-norm based formula¬ 
tion to the f 2 .p-norm regularized cases, providing more choices 
of p to fit the variety of sparsity requirements. We derive 
an efficient algorithm to solve the f 2 ,p-norm minimization 
problem and prove that our algorithm will monotonically 
decrease the objective until convergence when 0 < p < 2. 
Moreover, our proposed DFS retains the ability to select 
the most discriminative features and remove the redundant 
ones simultaneously. This enables it to outperform competing 
feature selection methods. Experiments on various types of 
real-word data sets illustrate the advantages of our proposed 
method. 

There are several interesting directions to investigate in the 
future. First, we would like to find a better way of dealing 

**’The code of BAHSIC offered on the author’s website is written in Python. 
We have rewritten it in Matlab for fair comparisons. 











12 


I 






-K-p=1 

-®- p = 0.1 
-< - p = 0.01 
■tH p = 0.001 



Number o1 selected features Number of selected features 


(a) ORL 


(b) USPS 


(c) ISOLET5 


Fig. 4. Comparison between DFS with different p values w.r.t classification accuracy, p = 0.001, 0.01, 0.1,1. 





(a) ORL, p = 0.1 


(b) COLON, p = 0.1 


(c) ISOLET5, p = 0.1 





(d) ORL, p = 1 (e) COLON, p = 1 (f) ISOLET5, p = 1 

Fig. 5. Performance variation of DFS with p = 0.1 (top line) and p = 1 (bottom line) w.r.t different values of the regularization parameter 7. 





(g) COIL20, p = 0.1 (h) COIL20, p = 0.5 



Fig. 6. Convergence behavior of DFS on COIL20 (left side) and COLON (right side) respectively when p = 0.1, 0.5,1. Top line is the objective value of 
DFS. Bottom line is divergence between tow consecutive A measured by (33). 
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with the singularity of the total scatter matrix St, which is 
addressed in this paper by regularization. Second, it is possible 
to extend this work to a kernel LDA version to deal with the 
nonlinear tasks. Finally, deciding the values of parameters is 
still an open problem, which is unsolved in many algorithms. 
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